GoogleLing: The Web as a Linguistic Corpus
نویسنده
چکیده
We describe software to transform any search engine or searchable corpus into a tool for linguistic research with a rich query syntax. We provide support for case sensitive searches, within-sentence and within-N-words match constraints, part-ofspeech restrictions on words, and “smart” verb-ending inflection wildcards. The software generalizes the query for the underlying search engine, and then processes the resulting pages with a set of natural language processing tools to extract matching sentences. Preliminary evaluation suggests that this greatly enhances linguists’ ability to use the web as a linguistic corpus.
منابع مشابه
Allophone-based acoustic modeling for Persian phoneme recognition
Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...
متن کاملLarge Linguistically-Processed Web Corpora for Multiple Languages
The Web contains vast amounts of linguistic data. One key issue for linguists and language technologists is how to access it. Commercial search engines give highly compromised access. An alternative is to crawl the Web ourselves, which also allows us to remove duplicates and nearduplicates, navigational material, and a range of other kinds of non-linguistic matter. We can also tokenize, lemmati...
متن کاملLinguagrid: a network of Linguistic and Semantic Services for the Italian Language
In order to handle the increasing amount of textual information today available on the web and exploit the knowledge latent in this mass of unstructured data, a wide variety of linguistic knowledge and resources (Language Identification, Morphological Analysis, Entity Extraction, etc.). is crucial. In the last decade LRaas (Language Resource as a Service) emerged as a novel paradigm for publish...
متن کاملA Corpus-based Approach to Linguistic Function
In this paper, we present our recent experience in constructing a first-of-its-kind functional corpus based on the theoretical framework of Systemic Functional Linguistics. Annotated on selected texts from the Penn Treebank, the corpus was built by a collaborative team on web-based annotation platform with several advanced features. After a discussion on the background and motivation of the pro...
متن کاملAnalyzing the Sense Distribution of Concordances Obtained by Web as Corpus Approach
In corpus-based lexicography and natural language processing fields some authors have proposed using the Internet as a source of corpora for obtaining concordances of words. Most techniques implemented with this method are based on information retrieval-oriented web searchers. However, rankings of concordances obtained by these search engines are not built according to linguistic criteria but t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002